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ABSTRACT 


This paper contributes to the literature on voice-recognition in the context of 


non-English language. Specifically, it aims to validate the techniques used to 
present the basic characteristics of speech, viz. voiced and unvoiced, 
that need to be evaluated when analysing speech signals. Zero Crossing Rate 
(ZCR) and Short Time Energy (STE) are used in this paper to perform signal 
pre-processing of continuous Malay speech to separate the voiced and 
unvoiced parts. The study is based on non-real time data which was 
developed from a collection of audio speeches. The signal is assessed using 
ZCR and STE for comparison purposes. The results revealed that ZCR are 
low for voiced part and high for unvoiced part whereas the STE is high for 
voiced part and low for unvoiced part. Thus, these two techniques can be 
used effectively for separating voiced and unvoiced for continuous 
Malay speech. 
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1, INTRODUCTION 

Speech technology has become popular as many applications today use speech as a medium to 
enhance our everyday life [1-2]. Speech recognition is different from voice recognition although it sounds 
similar [3]. Speech recognition is useful for people with a variety of disabilities such as those with physical 
disabilities who find typing difficult, painful or impossible and also those who have difficulties in 
recognizing words and in spelling such as those with dyslexia [4]. According to Abdullah et al. [5], 
the number of registered individuals with physical disabilities in Malaysia is 153,918 (33.1%) compared to 
speech disorder which is only 2,725 (0.59%). This shows that, there is still hope to help people with physical 
disabilities especially for those on smart wheelchair where they can use their own voice to ease their 
movements from one location to another. Most of the voice-recognition programs are in English [6] and there 
is a limited study conducted on other languages such as the Malay language. 

Malay language is the national language of Malaysia and it is also one of the four official languages 
of Singapore. Besides these countries, Indonesia, Brunei and southern Thailand also used Malay language as 
a spoken language but with different dialects and accents [7]. Wu et al. [8] highlighted that the Malay 
language is a non-tonal language which means that it does not need lexical stress. A set of 37 phonemes are 
used as the phonemic representation in Malay language: six vowels, 27 consonants, three diphthongs and one 
for silence [9-10]. Vowels are divided into vowel backness and vowel height while diphthongs are grouped 
by vowel backness. On the other hand, consonants are grouped by manner and place of articulation. 
Interestingly, many consonants are pronounced nearly in the same way as in the English language. Syllabic 
or phonemic is the speech unit in many languages [11-13]. However, Malay language is an alphabetic 
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language with salient syllabic structures [14]. From the phonological perspective, a syllable is made up of a 
consonant plus a vowel or a single vowel that follows the maximal onset and minimal coda [15]. Speech can 
be classified into voice and unvoiced [16-18]. Both classes have different characteristics in time and 
frequency domains which lead to different methods of processing [19-21]. 

This paper is organized in the following sequence. Section 2 presents a brief overview on the 
proposed algorithm that covers short time energy and zero crossings. Section 3 explains in depth the research 
methodology adopted in conducting the experiment. Section 4 discusses the results and finally, Section 5 
draws the conclusion and avenues for further research. 


2. TECHNIQUES FOR DETECTING VOICED AND UNVOICED SIGNALS 

This work used the basic metrics calculation on speech such as energy and zero-crossings by 
considering the non-real time approach which implies that the signal is available for measures. In the 
following subsections, a brief explanation is given on the techniques used in this work which are the 
STE and ZCR. 


2.1. Short Time Energy 

The amplitude of unvoiced segments is generally much lower than the amplitude of voiced 
segments. The short-time energy of the speech signal provides a convenient representation that reflects these 
amplitude variations. The short-time energy can be calculated by using (1) [22-24]: 


En = Lp--w [kl h[n — k])? (1) 


It 1s important to have a short duration window to be responsive to rapid amplitude changes. 
Unfortunately, a window that is too short will not provide a sufficient average to produce a smooth energy 
function. The effect of window on the time-dependent energy representation can be illustrated by the 
properties of two representative windows, 1.e., the rectangular window and can be referred to in (2): 


1,forO<n<N-1 
h{n] =| 0, otherwise 2) 
And the Hamming window which can be referred to in (3): 
27 
fin =f = — 0.46cos (=), for O<n<N-1 (3) 
0, otherwise 


Where N is the window length in the samples. 


The rectangular window applies equal weight to all the samples in the interval, whereas the 
Hamming window gives more weight to the center of the window. If the window size, N is small, 1.e., in the 
order of a pitch period or less, E(n) will fluctuate very rapidly depending on the exact details of the 
waveform. If N is too large, 1.e., in the order of several pitch periods, E (n) will change very slowly. Thus, it 
will not adequately reflect the changing properties of the speech signal. This implies that no single value of N 
is entirely satisfactory. 


2.2. Zero Crossings 

As for discrete-time signals, a zero-crossing occurs if successive samples have different algebraic 
signs. The rate at which zero crossings occur 1s a simple measure of frequency content of a signal and this is 
true for narrowband signals. For example, a sinusoidal signal of frequency fo, sampled at a rate of F,, has F,/fo 
samples per cycle of the sine wave. Each cycle has two zero crossings and for that reason, the long-time 
average rate of zero-crossings can be shown in (4): 


T= — Crossings per sample (4) 


Thus, the average zero-crossing rate provides a reasonable way in estimating the frequency of 
a sine wave. The computation required 1s defined in (5) to (7): 
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Zn = De oo/sgn(x[m] — sgn(x[m — 1]))}w[n — m] (5) 
Where 


_ (1, for x[n] > 0 
sgn(x[n]) = a otherwise (6) 
And, 


— for0<n<N-1 
w{[n| = 2N 


(7) 
0, otherwise 
The model for speech production suggests that the energy of voiced speech is concentrated below 
3 to 4 kHz, while for unvoiced speech, the energy is found at higher frequencies. There is a strong correlation 
between zero-crossing rate and energy distribution with frequency. If the zero-crossing rate is high, 
the speech signal is unvoiced and if the zero-crossing rate is low, the speech signal is voiced. 


3. RESEARCH METHODOLOGY 

MATLAB 2014a is used in this work. MATLAB is chosen as it offers many advantages. It contains 
a variety of signal processing and statistical tools, which help users in generating a variety of signals and 
plotting them. MATLAB excels at numerical computations, especially when dealing with vectors or matrices 
of data [25-26]. 

In this work, four respondents read different texts in Malay language acquired from local news 
websites using the speech corpus developed from a collection of audio speeches collected by Tan et. al [9]. 
The creation of the corpora is necessary especially for low-resourced languages [27]. The methodology used 
in this work is shown in Figure 1. 


Input speech (senerate signal Generate time for Define the 
signal length signal window 
Audioread [) NW = length (x) t= = n*(1/Fs); wintype = 


"rectwin'; 


Find STE rate 


Plot graphs for 
STE and ZCR 





Figure 1. Block diagram for the reseaerch 


The selected audio is read using the audioread () function. Then, the signal length 1s identified and the 
time taken for signal 1s generated by using the formula, ts =n * (1/Fs). In this work, the rectangular window is 
chosen because it applies equal weight to all the samples within the interval. The rate for ZCR and STE are 
then calculated. Lastly, the graphs for both ZCR and STE are plotted. 


4. RESULTS AND DISCUSSION 

The demographic profiles of four respondents involved in this work are shown in Table |. The 
reason for choosing different ethnic groups is to provide evidence that the two proposed techniques could 
detect the voiced and unvoiced in continuous Malay speech. 
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Table 1. Demographic Profiles 
Gender __ Ethnic Speech Uttered 
Female Chinese Perkara ini berlaku kepada pelakon Eizlan Yusuf yang berlakon dalam filem Panas yang menampilkan terma 

seksualiti dalam masyarakat 

Female Indian Pun begitu, Sarayala tidak sempat menunaikan hajatnya untuk makan kari ayam campur ubat, pisang dan sambal 
petai 

Male Malay Pada Sukan SEA Chiengmai di Thailand 1995, snoker menyumbang enam pingat emas 

Male Others Kami diberitahu Erra dan Yusri sah akan hadir pada hari tayangan yang ditetapkan pada 10 September pukul 9 
malam 


Table 2 shows the results of uttered speech from four respondents’ as stated in Table 1. The results 
reveal that for ZCR, the voiced part has lower energy as compared to STE which has higher energy for the 
voiced part and vice versa for the unvoiced part. 


Table 2. Results Of ZCR and STE For Continuous Speech Uttered by Different Ethnic Groups 
Gender Ethnic ZCR STE 


Female Chinese oalionaigen 


Female Indian 


Male Malay 


Male Others 





One word from the continuous speech uttered by a Chinese female has been selected as shown by 
the black dotted box in Figure 2 for ZCR and Figure 3 for STE. Figure 4 shows the extracted segment for the 
word “perkara” using PRAAT into its syllables which are “per”, “ka” and “ra”. The highlighted yellow color 
is the unvoiced signal. As can be seen from the figure, the yellow part is high for ZCR and low for STE. 
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3: 





Figure 2. The dotted box contains the word Figure 3. The dotted box contains the word “perkara”’ 
“perkara” from ZCR calculation from STE calculation 
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Figure 4. PRAAT 1s used to extract the word “perkara” 


CONCLUSION 
The two proposed techniques serve as a reasonable tool for the purpose of analysing Malay speech 


audio signals to determine the voiced and unvoiced signal. This particular stage is important to remove any 
noise before feeding the required speech signal to feature extraction. Future work may be considered in terms 
of studying the variations in the mean pitch and intensity of the Malay uttered speech by each ethnic group 
due to differences in their mother tongue. 
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